Loading in packages, removing objects from the environment and setting the working directory

if (!require("pacman")) install.packages("pacman"); library(pacman)
pacman::p_load(tidyverse, caTools, lubridate,
               ggthemes, janitor, corrplot, GGally, psych)

rm(list=ls())
setwd("/Users/Jesse/R/MedInsuranceProject/")

Reading in the data

df <- read_csv("/Users/Jesse/R/MedInsuranceProject/insurance.csv")

Let’s take a took at the data

head(df)
## # A tibble: 6 x 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.
str(df)
## spec_tbl_df [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr [1:1338] "female" "male" "male" "male" ...
##  $ bmi     : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
##  $ children: num [1:1338] 0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr [1:1338] "yes" "no" "no" "no" ...
##  $ region  : chr [1:1338] "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_character(),
##   ..   bmi = col_double(),
##   ..   children = col_double(),
##   ..   smoker = col_character(),
##   ..   region = col_character(),
##   ..   charges = col_double()
##   .. )

Let’s change the character columns to factors

df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)

Finally, let’s check for any missing values

any(is.na(df))
## [1] FALSE

No missing data! Let’s move on to some exploratory data analysis

EDA

summary(df)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Let’s take a look at a histogram of the insurance charges

df %>% ggplot(aes(charges)) + geom_histogram(aes(y=stat(density)), fill = "light blue", color = "white", bins = 30) + 
    ggtitle("Distribution of Charges") + theme_bw() + theme(plot.title = element_text(hjust = .5)) +
    geom_density(col="dark blue")

Let’s make the distribution closer to normal

df %>% ggplot(aes(log10(charges))) + geom_histogram(aes(y=stat(density)), fill = "light blue", color = "white", bins = 30) + 
    ggtitle("Distribution of Charges") + theme_bw() + theme(plot.title = element_text(hjust = .5)) + 
    geom_density(col = "dark blue")

Now let’s look at charges by region

df %>% ggplot(aes(region, charges)) + geom_boxplot(fill=c(3,4,6,7)) + 
    theme_bw() + ggtitle("Medical Insurance Charges by Region")

It looks like charges don’t differ much across regions, but the highest medical charges are in the Southwest

Let’s check out charges by smoker status

df %>% ggplot(aes(smoker, charges)) + geom_boxplot(fill=c(3,2)) + 
    theme_bw() + ggtitle("Medical Insurance Charges by Smoker Status")

As you would expect, smokers spend a lot more than non-smokers when it comes to medical insurance expenses

Next, let’s look at sex

df %>% ggplot(aes(sex, charges)) + geom_boxplot(fill=c(5,7)) + 
    theme_bw() + ggtitle("Medical Insurance Charges by Sex")

It appears sex doesn’t seem to affect medical expenses very much

Moving on to number of children

df %>% ggplot(aes(as.factor(children), charges)) + geom_boxplot(fill=c(2:7)) + 
    theme_bw() + ggtitle("Medical Insurance Charges by Number of Children")

Interestingly, medical expenses seem to increase with the number of children until you have five

Let’s check out obesity now. First we need to create a new column for obese, which is greater than 30 BMI

df <- df %>%
    mutate(obese = as.factor(if_else(bmi >= 30, "yes", "no")))

df %>% ggplot(aes(obese, charges)) + geom_boxplot(fill=c(6,4)) + 
    theme_bw() + ggtitle("Medical Charges by Obesity status (BMI 30+)")

As we can see, the average medical expenses for an obese individual is about $5000 more than non-obese people, while the median expenses betweent the group is similar

Lastly, let’s look at age, but also with smoker status

df %>% ggplot(aes(age, charges)) + geom_point(aes(color = smoker)) +
    theme_bw() + ggtitle("Medical Charges by Age and Smoker Status") + scale_color_manual(values=c("red", "blue"))

Age and medical expenses don’t appear to be linear, but with the smoker status you can see the relationship between smoking and higher medical expenses

Alright, let’s move on to looking at correlation between variables

Correlation

num_cols <- sapply(df,is.numeric) # using only numeric features
ggpairs(df[,num_cols]) # matrix of plots

pairs.panels(df[,num_cols]) # matrix of plots using pairs.panels

We can see that age and charges have the highest correlation among the numeric variables. It’s also interesting to note that none of our numeric values is highly correlated with each other, so multicollinearity wouldn’t be a problem.

Now let’s look at the correlation matrix for numeric features a different way

cor_mx <- cor(df[,num_cols]) # correlation matrix
cor_mx
##                age       bmi   children    charges
## age      1.0000000 0.1092719 0.04246900 0.29900819
## bmi      0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges  0.2990082 0.1983410 0.06799823 1.00000000
corrplot(cor_mx, method = "color", title = "Correlation Matrix for Numeric Features", 
         mar=c(0,0,1,0))

Let’s move on and look at correlation between all features. But first we need to create dummy variables for sex and smoker

df_dummy <- df %>%
    mutate(sex_dummy = if_else(sex == "male", 1, 0)) %>%
    mutate(smoker_dummy = if_else(smoker == "yes", 1, 0))

df_dummy <- df_dummy %>% 
    select(1,3,4,7,9,10)

# correlation all features
ggpairs(df_dummy)

pairs.panels(df_dummy)

dummy_cor_mx <- cor(df_dummy)
corrplot(dummy_cor_mx, method = "color", title = "Correlation Matrix for All Features", 
         mar=c(0,0,1,0))

We can see that smoking and medical expense are highly correlated.

Let’s move on to building a model using all of the original features, except using obese instead of BMI because BMI is essentially the same thing as obese

Building a model

set.seed(214)
sample <- sample.split(df$charges, SplitRatio = 0.70)

# Training Data
train = subset(df, sample == TRUE)

# Testing Data
test = subset(df, sample == FALSE)

med_model <- lm(charges ~ .-bmi, data = train)
summary(med_model)
## 
## Call:
## lm(formula = charges ~ . - bmi, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12378.6  -3632.7   -411.7   1184.5  28190.5 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4148.16     760.41  -5.455 6.28e-08 ***
## age               269.45      14.26  18.890  < 2e-16 ***
## sexmale            57.44     402.38   0.143  0.88652    
## children          502.32     164.85   3.047  0.00238 ** 
## smokeryes       22867.71     504.00  45.373  < 2e-16 ***
## regionnorthwest  -587.99     580.23  -1.013  0.31115    
## regionsoutheast  -498.98     571.72  -0.873  0.38302    
## regionsouthwest -1270.60     580.84  -2.188  0.02895 *  
## obeseyes         3999.25     409.43   9.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6109 on 927 degrees of freedom
## Multiple R-squared:  0.7304, Adjusted R-squared:  0.7281 
## F-statistic:   314 on 8 and 927 DF,  p-value: < 2.2e-16
plot(med_model)

Just using the original variables we get a fairly good R-squared of 0.7486, which implies that about 75% of the variation in charges can be explained by the independent variables. Also, all variables except for region and sex, are statistically significant predictors of medical expense (p < .05)

Let’s attempt to improve the model

First, let’s remove region and sex, as they were not statistically significant

med_model2 <- lm(charges ~ . -bmi -sex -region, data = train)
summary(med_model2)
## 
## Call:
## lm(formula = charges ~ . - bmi - sex - region, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12644.8  -3631.6   -284.8   1183.8  28264.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4711.97     646.07  -7.293 6.46e-13 ***
## age           269.60      14.26  18.900  < 2e-16 ***
## children      490.02     164.61   2.977  0.00299 ** 
## smokeryes   22932.73     500.57  45.813  < 2e-16 ***
## obeseyes     3975.47     402.32   9.881  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6112 on 931 degrees of freedom
## Multiple R-squared:  0.729,  Adjusted R-squared:  0.7278 
## F-statistic: 626.1 on 4 and 931 DF,  p-value: < 2.2e-16
plot(med_model2)

This model about the same, but slightly worse, with an R-squared of 0.7476, still implying about 75% of the variation in charges can be explained by the independent variables. Let’s try a model with an interaction between obesity and smoker status

med_model3 <- lm(charges ~ obese*smoker + age + children, data = train)
summary(med_model3)
## 
## Call:
## lm(formula = charges ~ obese * smoker + age + children, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19072.0  -2025.0  -1395.7   -604.1  24448.9 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2834.59     503.27  -5.632 2.36e-08 ***
## obeseyes              54.04     346.65   0.156    0.876    
## smokeryes          13077.17     548.52  23.841  < 2e-16 ***
## age                  272.33      10.99  24.779  < 2e-16 ***
## children             609.37     126.91   4.802 1.83e-06 ***
## obeseyes:smokeryes 19481.61     771.01  25.268  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4709 on 930 degrees of freedom
## Multiple R-squared:  0.8393, Adjusted R-squared:  0.8385 
## F-statistic: 971.6 on 5 and 930 DF,  p-value: < 2.2e-16
plot(med_model3)

This third model is much better than the first two, with an R-squared of .8503, implying about 85% of the variation in charges can be explained by the independent variables. Our Residual vs Fitted plot appears a little better, although the Normal Q-Q plot still indicates some problems with our fit.

We can interpret the model as follow:
  • For an obese non-smoker, we can expect $128.03 higher charges on average
  • For a non-obese smoker, we can expect $13,389.05 higher charges on average
  • For each year increase in age, we can expect a $265.98 increase in average charges
  • For each child a person has, we can expect a $514,10 increase in average charges
  • For a person who is an obese smoker, we can expect $128.03 + $13,389.05 + $19,740.49 = $33,257.57 higher charges on average

Finally, let’s make predictions

test$predicted <- predict(med_model3, newdata = test)
test %>%
  ggplot() + geom_point(aes(predicted, charges, color = smoker, shape = obese)) +
  geom_abline(color = "red") + ggtitle("Prediction vs. Real Values")

It looks like our model is a pretty solid fit for our test data and our results are pretty accurate.

#calculating residuals 
test$residuals <- test$charges - test$predicted

#plot residuals 
test %>%
  ggplot() + geom_pointrange(aes(predicted, residuals, ymin = 0, ymax = residuals, color = smoker, shape = obese)) +
  geom_hline(yintercept = 0) + ggtitle("Residuals vs. Fitted Values")

Overall, the residuals look pretty good, although there are several high residuals that could be caused by factors not included in the original data set.